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DETAILED ACTION 
Continued Examination Under 37 CFR 1.114 
1 . A request for continued examination under 37 CFR 1.114, including the fee set forth in 
37 CFR 1.17(e), was filed in this application after final rejection. Since this application is 
eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) 
has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 
37 CFR 1.1 14. Applicant's submission filed on 4/13/2007 has been entered. 

Claim Rejections - 35 USC § 103 

1 . The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in 
section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are 
such that the subject matter as a whole would have been obvious at the time the invention was made to a person 
having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the 
manner in which the invention was made. 

2. Claims 22-25, 27, 29-32, and 34 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Ezzat et al. (NPL document, "Visual Speech Synthesis by Morphing 
Visemes", herein referred to as "Ezzat") in view of Jiang et al. (NPL document, "Visual Speech 
Analysis with Application to Mandarin Speech Training", herein referred to as "Jiang") in view 
of Hon et al. (NPL Document,"Automatic Generation of Synthesis Units for Trainable Text-to- 
Speech Systems", herein referred to as "Hon"). 

As per claims 22, 23, and 30, Ezzat teaches the claimed "selecting" step on top of 1 st 
column on pg. 51 and states: 
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"there are many intermediate frames that lie between the chosen viseme images . . . 
Consequently, we compute a series of consecutive optical flowvectors between each 
intermediate image and its successor, and concatenate them all into one large flow vector 
that defines the global transformation between the chosen visemes". (emphasis added) 

And states in the abstract: 

we are able to synchronize the visual speech stream with the audio speech stream, and 
hence give the impression of a photorealistic talking face, 
(emphasis added) 

Here, the visemes represent a generic facial image that can be use to describe a particular 
sound and the flowvectors which contain visual and sound features are used in conjunction with 
the visemes. 

Ezzat does not explicitly teach the claimed "obtaining" step. Jiang teaches the claimed 

"obtaining" step by stating in the abstract: 

At each frame, region of interest is identified and 
key information is extracted. The preprocessed acoustic 
and visual information are then fed into a modular TDNN 
and combined for visual speech analysis, (emphasis added) 

states on (pg. 1 14, 4.2 Acoustic and Visual Input Representation, 1st paragraph): 

For acoustic data representation, we have followed 
the well-established approach to apply FFT on the Hamming 
windowed speech data to get 16 Melscale Fourier coefficients as 
input to the Acoustic input Layer. For visual data representation, 
we have performed the lip-tracking and feature points extraction 
task by applying our 2D multi-state lip shape model. Then we 
use both the color profile of the feature points on external and 
internal boundaries and position and movement of lip boundaries 
for feature extraction using principle component analysis (PCA). 
The extracted feature vectors are then fed to the Visual Input 
Layer, (emphasis added) 

Here, the Jiang teaches feature vectors (target feature vector) and teaches of visual data 
(visual features) and acoustic information (non-visual information). It would have been obvious 



Application/Control Number: 10/662,550 Page 4 

Art Unit: 2628 

to one of ordinary skill in the art at the time of invention to combine Ezzat with Jiang. Jiang 

teaches one advantage to obtaining feature vectors in order to help children improve their speech 

pronunciation (see section 5, pgs. 1 14-1 15, 1 st paragraph) by providing audio-visual feedback. 

Ezzat does not teach the claimed "unit selection process" and does not teach the claimed 

"in which a longest possible candidate image sample is selected". Hon teaches the claimed "unit 

selection process" by teaching of "Unit Selection" (title of section 4 on pg. 295) and suggests the 

claimed "longest possible candidate image sample is selected" by teaching of: 

If large memory resources and a large speech database are 
available, it is possible to use a multiple-instance system to 
construct long-units for frequent words and phrases that will 
undoubtedly achieve optimal concatenation quality 

(top of 1 st col on pg. 296) 

In this instance, the phrases "multiple-instance system" and "achieve optimal concatenation" 
would suggest to one of ordinary skill in the art to concept of selecting a longest possible 
candidate image sample. This is because longer or longest candidate image samples can have 
more words or sounds already connected within them and thus, less reconstruction of smaller 
image samples is necessary. By avoiding these smaller reconstructions of sequences 
(reconstruction through concatenation or joining samples together in a sequence) and by using 
one longer one instead, optimal quality is achieved. 

It would have been obvious to one of ordinary skill in the art at the time of invention to 
combine Hon with the combinable system of Ezzat and Jiang. One advantage to the combination 
is that with Hon, unit selection features selected from a database of a large amount of candidates 
can produce optimal concatenation quality (top of 1 st col on pg. 296). 
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As per claims 24-25, and 31-32, Ezzat teaches the claimed "selecting ... using a 

comparison of a combination of visual features and non-visual features with the target feature 

vector" by stating on pg. 47, 2 nd col, 2 nd paragraph: 

For any input text, we determine the appropriate sequence of viseme morphs to make, 
as well as the rate of the transformations by utilizing the output of the natural language 
processing unit (emphasis added) 

In order to determine the appropriate sequence, the system would have to perform a comparison 

of visual and non- visual features with a given target vector in order to produce the output as 

stated. Further, this construction process of an appropriate sequence of viseme morphs would 

require selecting candidate image samples where these samples could be used to transition 

between through transformation. 

Ezzat teaches the claimed compiling by teaching of concatenation (see quote from top of 

1 st column on pg. 51 above). 

As per claim 27 and 34, Ezzat teaches the claimed first database by teaching of recording 
and collecting one image per English phoneme (bottom of 1 st column on pg. 47 under "Corpus 
and Viseme Acquisition", also see figure 2). 

Ezzat teaches the claimed second and third database by teaching of "Flow database" (pg. 
54, 2 nd column), which contain optical flow vectors which specify transition data between 
visemes (includes visual data and includes storing non-visual data i.e. sound transitions). 
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As per claim 29, Ezzat teaches the claimed first database in figure 2, the claimed second 
database and the claimed third database on pg. 54, 2 nd column under "Flow database" where this 
database is formed to specify visual and non-visual data between animation transitions (frames). 

3. Claims 28 and 35 are rejected under 35 U.S.C. 103(a) as being unpatentable over Ezzat in 
view of Jiang in further view of Hon in further of view of Brand (NPL Document, "Voice 
Puppetry", herein referred to as "Brand"). 

As per claims 28 and 35, Ezzat does not teach the claimed limitations. 

Brand teaches the claimed "selecting ... a number of candidates" and the claimed 

"Viterbi search" by stating on the bottom half of the 1 st col on pg. 25: 

The Viterbi sequence, while most likely, may only represent a small fraction of the total 
probability mass — there may be thousands of slightly different state sequences that 
are nearly as likely. If this were to happen in the voice puppet, V would be a very poor 
representation of the relevant information 
in the audio, and the animation quality would suffer greatly. 

. . . These problems are virtually banished with entropically estimated models because 
entropy minimization concentrates the probability mass on the optimal Viterbi 
sequence, (emphasis added) 

Brand teaches the claimed concatenation cost by stating on pg. 26, very bottom of 1 st col 

and very top of 2 nd col: 

We quantified this with a squared error measure of divergence between groundtruth (x) 
and reconstructed (y) facial motion vectors, weighted to penalize motions in the wrong 
direction, (emphasis added) 

It would have been obvious to one of ordinary skill in the art at the time of invention to combine 
Brand with the combinable system of Ezzat, Jiang, and Hon. Brand teaches the advantage of 
using an optimal Viterbi sequence with a large number of state sequences (candidates) to reduce 
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the size to the most optimal ones in order to remove poor animation quality (1 st col on pg. 25 see 
quote above). 

Response to Arguments 
1 . Applicant's arguments filed 4/13/2007 have been fully considered but they are not 
persuasive. 

Applicant's primary argument relates to the motivation and reasons to combine the references of 
Ezzat and Hon (pages 3 to 6 in filed response). One primary reason applicant argues that the 
combination is not proper is because Hon in section 2.2 criticizes the use of diphones (top of 
page 4) and that the reference of Ezzat states that diphones are very good and are a fundamental 
to the entire process of Ezzat for synthesizing video (second paragraph page 5). 
The examiner acknowledges that Hon in section 2.2, criticizes the use of diphones. In one 
instance, Hon states "there can be large distortions due to the difference in spectra between the 
stationary parts of two units obtained from different contexts" (in section 2.2) in relation to the 
use of diphones. Hon elaborates that the use of senones and triphones is preferred over diphones 
(in section 2.3) because they allow for a "flexible memory-quality compromise" which has been 
"well studied in the speech recognition community" (top of 1 st col on page 294). However, the 
examiner respectfully maintains that the rejections are proper for at least two reasons. First, the 
prior art rejections of record presented in the office action do not explicitly rely upon Hon for 
teaching diphones but rather only the unit selection process. The purpose behind the cited unit 
selection process is primarily to pick a best candidate instance in order to achieve the best quality 
output. The unit selection process is cited by the examiner to stress the process of selecting an 
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instance for better concatenation from a database or pool of possible instances. The unit 
selection process is not cited to emphasize the use diphone alternatives. Second, according to the 
disclosure of section 2.2 in Hon, the technology used in place of diphones enhances the use of 
diphones and is also well known in the art. The fact that this technology is well known in the art 
would support to conclusion that one of ordinary skill can combine the references of Ezzat and 
Hon into a workable combination. One of ordinary skill in the art would achieve the 
combination by using the improved technology of Hon to improve Ezzat. Furthermore, the 
enhanced qualities achieved beyond diphones would even motivate one of ordinary skill in the 
art to build the combination in order to achieve a higher quality output. For at least these 
reasons, the examiner respectfully maintains that the combination of prior references is proper. 

Conclusion 

2. THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time 
policy as set forth in 37 CFR 1.136(a). 

A shortened statutory period for reply to this final action is set to expire THREE 
MONTHS from the mailing date of this action. In the event a first reply is filed within TWO 
MONTHS of the mailing date of this final action and the advisory action is not mailed until after 
the end of the THREE-MONTH shortened statutory period, then the shortened statutory period 
will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 
CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, 
however, will the statutory period for reply expire later than SIX MONTHS from the mailing 
date of this final action. 
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Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Daniel F. Hajnik whose telephone number is (571) 272-7642. 
The examiner can normally be reached on Mon-Fri (8:30A-5:00P). 



supervisor, Ulka J. Chauhan can be reached on (571) 272-7782. The fax phone number for the 
organization where this application or proceeding is assigned is 571-273t8300. 

Information regarding the status of an application may be obtained from the Patent 
Application Information Retrieval (PAIR) system. Status information for published applications 
may be obtained from either Private PAIR or Public PAIR. Status information for unpublished 
applications is available through Private PAIR only. For more information about the PAIR 
system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR 
system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would 
like assistance from a USPTO Customer Service Representative or access to the automated 
information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 



If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
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Supervisor Patent Examiner 



